In this paper, we propose PanoViT, a panorama vision transformer to estimate the room layout from a single panoramic image. Compared to CNN models, our PanoViT is more proficient in learning global information from the panoramic image for the estimation of complex room layouts. Considering the difference between a perspective image and an equirectangular image, we design a novel recurrent position embedding and a patch sampling method for the processing of panoramic images. In addition to extracting global information, PanoViT also includes a frequency-domain edge enhancement module and a 3D loss to extract local geometric features in a panoramic image. Experimental results on several datasets demonstrate that our method outperforms state-of-the-art solutions in room layout prediction accuracy.
translated by 谷歌翻译
Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, our experiments show that these models have significant accuracy boosts on popular benchmarks, including new state-of-the-art performance on Spider.
translated by 谷歌翻译
Learning 3D human pose prior is essential to human-centered AI. Here, we present GFPose, a versatile framework to model plausible 3D human poses for various applications. At the core of GFPose is a time-dependent score network, which estimates the gradient on each body joint and progressively denoises the perturbed 3D human pose to match a given task specification. During the denoising process, GFPose implicitly incorporates pose priors in gradients and unifies various discriminative and generative tasks in an elegant framework. Despite the simplicity, GFPose demonstrates great potential in several downstream tasks. Our experiments empirically show that 1) as a multi-hypothesis pose estimator, GFPose outperforms existing SOTAs by 20% on Human3.6M dataset. 2) as a single-hypothesis pose estimator, GFPose achieves comparable results to deterministic SOTAs, even with a vanilla backbone. 3) GFPose is able to produce diverse and realistic samples in pose denoising, completion and generation tasks. Project page https://sites.google.com/view/gfpose/
translated by 谷歌翻译
Graph anomaly detection (GAD) is a vital task in graph-based machine learning and has been widely applied in many real-world applications. The primary goal of GAD is to capture anomalous nodes from graph datasets, which evidently deviate from the majority of nodes. Recent methods have paid attention to various scales of contrastive strategies for GAD, i.e., node-subgraph and node-node contrasts. However, they neglect the subgraph-subgraph comparison information which the normal and abnormal subgraph pairs behave differently in terms of embeddings and structures in GAD, resulting in sub-optimal task performance. In this paper, we fulfill the above idea in the proposed multi-view multi-scale contrastive learning framework with subgraph-subgraph contrast for the first practice. To be specific, we regard the original input graph as the first view and generate the second view by graph augmentation with edge modifications. With the guidance of maximizing the similarity of the subgraph pairs, the proposed subgraph-subgraph contrast contributes to more robust subgraph embeddings despite of the structure variation. Moreover, the introduced subgraph-subgraph contrast cooperates well with the widely-adopted node-subgraph and node-node contrastive counterparts for mutual GAD performance promotions. Besides, we also conduct sufficient experiments to investigate the impact of different graph augmentation approaches on detection performance. The comprehensive experimental results well demonstrate the superiority of our method compared with the state-of-the-art approaches and the effectiveness of the multi-view subgraph pair contrastive strategy for the GAD task.
translated by 谷歌翻译
Salient object detection (SOD) focuses on distinguishing the most conspicuous objects in the scene. However, most related works are based on RGB images, which lose massive useful information. Accordingly, with the maturity of thermal technology, RGB-T (RGB-Thermal) multi-modality tasks attain more and more attention. Thermal infrared images carry important information which can be used to improve the accuracy of SOD prediction. To accomplish it, the methods to integrate multi-modal information and suppress noises are critical. In this paper, we propose a novel network called Interactive Context-Aware Network (ICANet). It contains three modules that can effectively perform the cross-modal and cross-scale fusions. We design a Hybrid Feature Fusion (HFF) module to integrate the features of two modalities, which utilizes two types of feature extraction. The Multi-Scale Attention Reinforcement (MSAR) and Upper Fusion (UF) blocks are responsible for the cross-scale fusion that converges different levels of features and generate the prediction maps. We also raise a novel Context-Aware Multi-Supervised Network (CAMSNet) to calculate the content loss between the prediction and the ground truth (GT). Experiments prove that our network performs favorably against the state-of-the-art RGB-T SOD methods.
translated by 谷歌翻译
Positioning with one inertial measurement unit and one ranging sensor is commonly thought to be feasible only when trajectories are in certain patterns ensuring observability. For this reason, to pursue observable patterns, it is required either exciting the trajectory or searching key nodes in a long interval, which is commonly highly nonlinear and may also lack resilience. Therefore, such a positioning approach is still not widely accepted in real-world applications. To address this issue, this work first investigates the dissipative nature of flying robots considering aerial drag effects and re-formulates the corresponding positioning problem, which guarantees observability almost surely. On this basis, a dimension-reduced wriggling estimator is proposed accordingly. This estimator slides the estimation horizon in a stepping manner, and output matrices can be approximately evaluated based on the historical estimation sequence. The computational complexity is then further reduced via a dimension-reduction approach using polynomial fittings. In this way, the states of robots can be estimated via linear programming in a sufficiently long interval, and the degree of observability is thereby further enhanced because an adequate redundancy of measurements is available for each estimation. Subsequently, the estimator's convergence and numerical stability are proven theoretically. Finally, both indoor and outdoor experiments verify that the proposed estimator can achieve decimeter-level precision at hundreds of hertz per second, and it is resilient to sensors' failures. Hopefully, this study can provide a new practical approach for self-localization as well as relative positioning of cooperative agents with low-cost and lightweight sensors.
translated by 谷歌翻译
Artificial Intelligence (AI) is having a tremendous impact across most areas of science. Applications of AI in healthcare have the potential to improve our ability to detect, diagnose, prognose, and intervene on human disease. For AI models to be used clinically, they need to be made safe, reproducible and robust, and the underlying software framework must be aware of the particularities (e.g. geometry, physiology, physics) of medical data being processed. This work introduces MONAI, a freely available, community-supported, and consortium-led PyTorch-based framework for deep learning in healthcare. MONAI extends PyTorch to support medical data, with a particular focus on imaging, and provide purpose-specific AI model architectures, transformations and utilities that streamline the development and deployment of medical AI models. MONAI follows best practices for software-development, providing an easy-to-use, robust, well-documented, and well-tested software framework. MONAI preserves the simple, additive, and compositional approach of its underlying PyTorch libraries. MONAI is being used by and receiving contributions from research, clinical and industrial teams from around the world, who are pursuing applications spanning nearly every aspect of healthcare.
translated by 谷歌翻译
激光点云(LPC)的非均匀分布和极稀疏的性质给其高效压缩带来了重大挑战。本文提出了一个新颖的端到端,完全物质的深层框架,该框架将原始LPC编码为OCTREE结构,并分层分解OCTREE熵模型。所提出的框架利用层次的潜在变量作为侧面信息来封装兄弟姐妹和祖先依赖性,该依赖性为点云分布的建模提供了足够的上下文信息,同时启用了同一层中的Octree节点的并行编码和解码。此外,我们提出了一个用于压缩潜在变量的残留编码框架,该框架通过渐进的下采样探索了每一层的空间相关性,并用完全属于熵模型对相应的残差进行建模。此外,我们提出了剩余编码的软添加和减法,以提高网络灵活性。 LIDAR基准Semantickitti和MPEG指定数据集福特的综合实验结果表明,我们提出的框架在所有以前的LPC框架中都实现了最先进的性能。此外,我们的端到端,完全物质化的框架被实验证明是高平行和及时效率的,并且与以前的LPC压缩方法相比,与以前的最新方法相比,可以节省超过99.8%的解码时间。
translated by 谷歌翻译
重要性采样(IS)是非政策评估中的一种流行技术,它重新赋予了重播缓冲液中轨迹的回归以提高样本效率。但是,对IS进行培训可能是不稳定的,以前试图解决此问题的尝试主要集中于分析IS的差异。在本文中,我们揭示了不稳定性与IS的重复使用偏见的新概念有关 - 由重复使用缓冲液重用进行评估和优化引起的非政策评估偏差。从理论上讲,我们证明了对当前策略的非政策评估和优化,并通过重播缓冲区的数据导致目标高估,这可能会导致错误的梯度更新并退化性能。我们进一步提供了重复使用偏差的高概率上限,并表明控制上限的一个项可以通过引入非政策算法的稳定性概念来控制重复使用偏置。基于这些分析,我们最终提出了一种新颖的偏见调查重要性抽样(BIRIS)框架以及实际算法,可以减轻重复使用偏见的负面影响。实验结果表明,我们基于BIRIS的方法可以显着提高一系列连续控制任务的样品效率。
translated by 谷歌翻译
如今,基础模型已成为人工智能中的基本基础设施之一,铺平了通往通用情报的方式。但是,现实提出了两个紧急挑战:现有的基础模型由英语社区主导;用户通常会获得有限的资源,因此不能总是使用基础模型。为了支持中文社区的发展,我们介绍了一个名为Fengshenbang的开源项目,该项目由认知计算与自然语言研究中心(CCNL)领导。我们的项目具有全面的功能,包括大型预培训模型,用户友好的API,基准,数据集等。我们将所有这些都包装在三个子项目中:风水次模型,风水框架和狂热基准。 Fengshenbang的开源路线图旨在重新评估中国预培训的大型大型模型的开源社区,促使整个中国大型模型社区的发展。我们还希望构建一个以用户为中心的开源生态系统,以允许个人访问所需的模型以匹配其计算资源。此外,我们邀请公司,大学和研究机构与我们合作建立大型开源模型的生态系统。我们希望这个项目将成为中国认知情报的基础。
translated by 谷歌翻译